Data Loading and Quality Check¶

information about some column values

  • Open : an indicator for whether the store was open: 0 = closed, 1 = open

  • StateHoliday : indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

  • StoreType : differentiates between 4 different store models: a, b, c, d

  • Promo2 : Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

  • Assortment : describes an assortment level: a = basic, b = extra, c = extended

In [46]:
import pandas as pd
from utils.loaders import load_config
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.io as pio
pio.renderers.default = "notebook"

conf = load_config("../config/config.yaml")
opening file
In [47]:
df = pd.read_csv(conf["data"]["train"],parse_dates=['Date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   Store          1017209 non-null  int64         
 1   DayOfWeek      1017209 non-null  int64         
 2   Date           1017209 non-null  datetime64[ns]
 3   Sales          1017209 non-null  int64         
 4   Customers      1017209 non-null  int64         
 5   Open           1017209 non-null  int64         
 6   Promo          1017209 non-null  int64         
 7   StateHoliday   1017209 non-null  object        
 8   SchoolHoliday  1017209 non-null  int64         
dtypes: datetime64[ns](1), int64(7), object(1)
memory usage: 69.8+ MB
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\4188774538.py:1: DtypeWarning:

Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.

  • we have mixed types in some columns
In [48]:
for col in df.columns:
    inferred_type = pd.api.types.infer_dtype(df[col])
    print(f"Column '{col}': Inferred type is '{inferred_type}'")
Column 'Store': Inferred type is 'integer'
Column 'DayOfWeek': Inferred type is 'integer'
Column 'Date': Inferred type is 'datetime64'
Column 'Sales': Inferred type is 'integer'
Column 'Customers': Inferred type is 'integer'
Column 'Open': Inferred type is 'integer'
Column 'Promo': Inferred type is 'integer'
Column 'StateHoliday': Inferred type is 'mixed-integer'
Column 'SchoolHoliday': Inferred type is 'integer'
In [49]:
df["StateHoliday"] = df["StateHoliday"].astype(str)
In [50]:
store_df = pd.read_csv(conf["data"]["store"])
store_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Store                      1115 non-null   int64  
 1   StoreType                  1115 non-null   object 
 2   Assortment                 1115 non-null   object 
 3   CompetitionDistance        1112 non-null   float64
 4   CompetitionOpenSinceMonth  761 non-null    float64
 5   CompetitionOpenSinceYear   761 non-null    float64
 6   Promo2                     1115 non-null   int64  
 7   Promo2SinceWeek            571 non-null    float64
 8   Promo2SinceYear            571 non-null    float64
 9   PromoInterval              571 non-null    object 
dtypes: float64(5), int64(2), object(3)
memory usage: 87.2+ KB
In [51]:
merged_df = pd.merge(df,store_df,on="Store", how='inner')
merged_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype         
---  ------                     --------------    -----         
 0   Store                      1017209 non-null  int64         
 1   DayOfWeek                  1017209 non-null  int64         
 2   Date                       1017209 non-null  datetime64[ns]
 3   Sales                      1017209 non-null  int64         
 4   Customers                  1017209 non-null  int64         
 5   Open                       1017209 non-null  int64         
 6   Promo                      1017209 non-null  int64         
 7   StateHoliday               1017209 non-null  object        
 8   SchoolHoliday              1017209 non-null  int64         
 9   StoreType                  1017209 non-null  object        
 10  Assortment                 1017209 non-null  object        
 11  CompetitionDistance        1014567 non-null  float64       
 12  CompetitionOpenSinceMonth  693861 non-null   float64       
 13  CompetitionOpenSinceYear   693861 non-null   float64       
 14  Promo2                     1017209 non-null  int64         
 15  Promo2SinceWeek            509178 non-null   float64       
 16  Promo2SinceYear            509178 non-null   float64       
 17  PromoInterval              509178 non-null   object        
dtypes: datetime64[ns](1), float64(5), int64(8), object(4)
memory usage: 139.7+ MB
In [52]:
train_df = merged_df[merged_df["Open"] == 1]
train_df.head(20)
Out[52]:
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
0 1 5 2015-07-31 5263 555 1 1 0 1 c a 1270.0 9.0 2008.0 0 NaN NaN NaN
1 2 5 2015-07-31 6064 625 1 1 0 1 a a 570.0 11.0 2007.0 1 13.0 2010.0 Jan,Apr,Jul,Oct
2 3 5 2015-07-31 8314 821 1 1 0 1 a a 14130.0 12.0 2006.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
3 4 5 2015-07-31 13995 1498 1 1 0 1 c c 620.0 9.0 2009.0 0 NaN NaN NaN
4 5 5 2015-07-31 4822 559 1 1 0 1 a a 29910.0 4.0 2015.0 0 NaN NaN NaN
5 6 5 2015-07-31 5651 589 1 1 0 1 a a 310.0 12.0 2013.0 0 NaN NaN NaN
6 7 5 2015-07-31 15344 1414 1 1 0 1 a c 24000.0 4.0 2013.0 0 NaN NaN NaN
7 8 5 2015-07-31 8492 833 1 1 0 1 a a 7520.0 10.0 2014.0 0 NaN NaN NaN
8 9 5 2015-07-31 8565 687 1 1 0 1 a c 2030.0 8.0 2000.0 0 NaN NaN NaN
9 10 5 2015-07-31 7185 681 1 1 0 1 a a 3160.0 9.0 2009.0 0 NaN NaN NaN
10 11 5 2015-07-31 10457 1236 1 1 0 1 a c 960.0 11.0 2011.0 1 1.0 2012.0 Jan,Apr,Jul,Oct
11 12 5 2015-07-31 8959 962 1 1 0 1 a c 1070.0 NaN NaN 1 13.0 2010.0 Jan,Apr,Jul,Oct
12 13 5 2015-07-31 8821 568 1 1 0 0 d a 310.0 NaN NaN 1 45.0 2009.0 Feb,May,Aug,Nov
13 14 5 2015-07-31 6544 710 1 1 0 1 a a 1300.0 3.0 2014.0 1 40.0 2011.0 Jan,Apr,Jul,Oct
14 15 5 2015-07-31 9191 766 1 1 0 1 d c 4110.0 3.0 2010.0 1 14.0 2011.0 Jan,Apr,Jul,Oct
15 16 5 2015-07-31 10231 979 1 1 0 1 a c 3270.0 NaN NaN 0 NaN NaN NaN
16 17 5 2015-07-31 8430 946 1 1 0 1 a a 50.0 12.0 2005.0 1 26.0 2010.0 Jan,Apr,Jul,Oct
17 18 5 2015-07-31 10071 936 1 1 0 1 d c 13840.0 6.0 2010.0 1 14.0 2012.0 Jan,Apr,Jul,Oct
18 19 5 2015-07-31 8234 718 1 1 0 1 a c 3240.0 NaN NaN 1 22.0 2011.0 Mar,Jun,Sept,Dec
19 20 5 2015-07-31 9593 974 1 1 0 0 d a 2340.0 5.0 2009.0 1 40.0 2014.0 Jan,Apr,Jul,Oct
In [53]:
missing_values =train_df.isnull().sum()
percentage_missing = (missing_values / len(train_df))*100
percentage_missing
Out[53]:
Store                         0.000000
DayOfWeek                     0.000000
Date                          0.000000
Sales                         0.000000
Customers                     0.000000
Open                          0.000000
Promo                         0.000000
StateHoliday                  0.000000
SchoolHoliday                 0.000000
StoreType                     0.000000
Assortment                    0.000000
CompetitionDistance           0.258884
CompetitionOpenSinceMonth    31.812120
CompetitionOpenSinceYear     31.812120
Promo2                        0.000000
Promo2SinceWeek              50.131574
Promo2SinceYear              50.131574
PromoInterval                50.131574
dtype: float64
In [54]:
px.box(train_df,x="Sales")

Conclusion:

  • Promo2SinceWeek,Promo2SinceYear, PromoInterval have 50% missing values ~ analyze their importance to exclude or impute
  • All columns have correct types
  • we have some outliers with the extreme one being > 40k but it is likely caused by promotions ~ needs to be checked

Data Analysis for Target Variable¶

In [55]:
train_df["Month"] = train_df["Date"].dt.month
train_df["Year"] = train_df["Date"].dt.year
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\2692207235.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\2692207235.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [56]:
train_df.groupby(["StoreType"]).Store.count()
Out[56]:
StoreType
a    457077
b     15563
c    112978
d    258774
Name: Store, dtype: int64
  • Stores of type b are lower than other type of store
In [57]:
train_df.groupby(["StoreType","Assortment"]).Store.count()
Out[57]:
StoreType  Assortment
a          a             286053
           c             171024
b          a               6409
           b               8212
           c                942
c          a              58561
           c              54417
d          a              93886
           c             164888
Name: Store, dtype: int64
  • Stores of type b are the only ones with assortment type b ( extra )
In [58]:
px.histogram(train_df,x="Sales")
  • Daily sales are concentrated in the 3k-8k range, but the distribution is right-skewed with extrem peaks > 20 k up to ~40k likely driven by promotions
In [59]:
avg_sales_day_df = train_df.groupby(["Date"]).Sales.mean().reset_index()
px.line(avg_sales_day_df, x="Date", y="Sales")
  • Sales follow a predictable weekly cycle, with weekends (and especially December) showing peaks. A sharp drop occurs each January, consistent with post-holiday demand normalization. Promotional spikes occur throughout the year, amplifying variability
In [60]:
px.box(train_df,x="Sales")
  • Most daily sales fall between ~4.5k and 8k (IQR), with a median near 6k. However, extreme outliers are present, including one day >40k, likely linked to exceptional promotional activity.”
In [61]:
px.box(train_df,x="Sales", color="DayOfWeek")
  • Sales show a clear weekly pattern: weekdays are stable but lower, while Sundays and Mondays drive significantly higher sales, albeit with more variability. This suggests staffing and inventory planning should be adjusted to meet peak demand at the start and end of the week.
In [62]:
avg_month_sales = train_df.groupby(["Year","Month"]).Sales.mean().reset_index()
px.line(avg_month_sales,x="Month",y="Sales", color="Year")
  • we can notice a clear cycle , sales follow a strong annual cycle: demand peaks in decembre and drops in january , mid-year sales remain low and stable, this implies seasonality
In [63]:
avg_sales_Promo_df = train_df.groupby(["Promo"]).Sales.mean().reset_index()
px.bar(avg_sales_Promo_df,y="Sales",x="Promo",text_auto=".2f")
In [64]:
promo_uplift = ((avg_sales_Promo_df.iloc[1] - avg_sales_Promo_df.iloc[0]) / avg_sales_Promo_df.iloc[0] ) * 100
promo_uplift
Out[64]:
Promo          inf
Sales    38.770714
dtype: float64
  • Promotions are highly effective, lifting sales by ~38% on average compared to normal days.
In [65]:
promo_sales_store = train_df.groupby(["StoreType","Promo"])["Sales"].mean().unstack()

promo_sales_store["uplift_pct"] = (promo_sales_store[1] - promo_sales_store[0]) / promo_sales_store[0] * 100
promo_sales_store
Out[65]:
Promo 0 1 uplift_pct
StoreType
a 5808.500091 8304.102317 42.964658
b 9566.861686 11307.994108 18.199619
c 6028.063255 8041.995526 33.409276
d 5855.268289 8017.828656 36.933583
  • Promotions are most effective in StoreType a, while StoreType b already drives high sales with only modest promo uplift. This suggests tailoring promo spend: invest more in a/c/d, and rethink promo strategy in b where uplift is limited
In [66]:
promo_sales_month = train_df.groupby(["Month","Promo"])["Sales"].mean().unstack()
promo_sales_month["uplift_pct"] = (promo_sales_month[1] - promo_sales_month[0]) / promo_sales_month[0] * 100
promo_sales_month
Out[66]:
Promo 0 1 uplift_pct
Month
1 5591.904236 7782.800277 39.179785
2 5747.812834 7772.909263 35.232470
3 5805.723545 8252.983345 42.152538
4 5855.305439 8426.029528 43.904184
5 6300.749699 8087.946860 28.364833
6 5871.747172 8508.684577 44.908906
7 5751.911514 8257.588809 43.562515
8 5680.220869 7926.496229 39.545564
9 5528.634084 7916.399032 43.189057
10 5845.745508 7646.399697 30.802815
11 6257.749173 8163.897052 30.460599
12 7318.434182 10490.790244 43.347470
In [67]:
avg_sales_day_Promo_df = train_df.groupby(["Date","Promo"]).Sales.mean().reset_index()
px.line(avg_sales_day_Promo_df,x="Date",y="Sales", color="Promo")
  • Promos are consistently effective year-round, but their relative impact is strongest in spring/summer (when baseline demand is lower) and in December (amplifying the holiday peak). October–November promos show diminishing returns
In [68]:
avg_sales_promo2_df = train_df.groupby(["Promo2"]).Sales.mean().reset_index()
px.bar(avg_sales_promo2_df,y="Sales",x="Promo2",text_auto=".2f")
In [69]:
avg_sales_school_df = train_df.groupby(["SchoolHoliday"]).Sales.mean().reset_index()
px.bar(avg_sales_school_df,y="Sales",x="SchoolHoliday",text_auto=".2f")
  • School Holiday has little to no impact on sales
In [70]:
avg_sales_state_df = train_df.groupby(["StateHoliday"]).Sales.mean().reset_index()
px.bar(avg_sales_state_df,y="Sales",x="StateHoliday",text_auto=".2f")
In [71]:
baseline = avg_sales_state_df[avg_sales_state_df["StateHoliday"] == "0"].Sales
avg_sales_state_df.Sales.apply( lambda x: (x - baseline)/ baseline * 100)
Out[71]:
0
0 0.000000
1 22.060154
2 42.199875
3 40.126920
  • Public holidays increase sales by ~22% compared to normal days.

  • Easter and Christmas are the biggest demand drivers (+40–43%).

In [72]:
avg_sales_store_df = train_df.groupby(["StoreType"]).Sales.mean().reset_index()
px.bar(avg_sales_store_df,y="Sales",x="StoreType",text_auto=".2f")
  • Store Type b has the highest sales compared to others, why ?
In [73]:
#CompetitionDistance ranges from very small (tens of meters) up to huge numbers (millions of meters) log scale is needed to realistically capture the change
train_df["CompDist_log"] = np.log1p(train_df["CompetitionDistance"])
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\3548370335.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [74]:
distance_from_competition = train_df.groupby(["StoreType"]).CompDist_log.sum().reset_index()
px.bar(distance_from_competition, y="CompDist_log", x="StoreType", text_auto=".2f")
  • StoreType b stores are located in high-traffic retail clusters with (more customers) competitors nearby
In [75]:
assor_store = train_df.groupby(["StoreType","Assortment"]).Sales.mean().reset_index()
px.bar(assor_store,x="StoreType",y="Sales", color="Assortment")
  • StoreType b consistently outperforms others in daily sales, driven by broader assortments (extended/extra). Other store types (a/c/d) perform similarly, with assortment mix making only modest differences.
In [76]:
temp = train_df.groupby(["StoreType"]).Sales.sum().reset_index()
px.bar(temp,x="StoreType",y="Sales")
  • While Stores type b have higher sales it does not contribute to the overall sales since the number of type b stores are low
In [77]:
avg_sales_dist_df = train_df.groupby(["CompDist_log"]).Sales.mean().reset_index()
px.scatter(avg_sales_dist_df, x="CompDist_log", y="Sales")
  • CompetitionDistance has little predictive power on its own stores can succeed both near and far from competitors. This suggests that store type, assortment, and promotions are much stronger drivers of sales than competitor proximity
In [78]:
avg_sales_assor_df = train_df.groupby(["Assortment"]).Sales.mean().reset_index()
px.bar(avg_sales_assor_df,y="Sales",x="Assortment",text_auto=".2f")
  • Assortment b gets more sales than other
In [79]:
px.scatter(train_df,x="Customers",y="Sales", color="StoreType", opacity=0.5)
  • StoreType d achieves higher sales per customer, while Type b relies on higher traffic but with lower spend per visit.
In [80]:
avg_basket_size = train_df.groupby("StoreType").apply(lambda x : (x["Sales"]/x["Customers"]).mean())
print(avg_basket_size)
StoreType
a     8.846277
b     5.133097
c     8.626227
d    11.277862
dtype: float64
C:\Users\ahmed\AppData\Local\Temp\ipykernel_17168\66509408.py:1: FutureWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

Data Preprocessing¶